Heart Disease Prediction AI

Image
Image
Image
Image
Image
Image
Image
Image

My Role

Data Scientist & ML Engineer – Predictive Pipeline Development

  • Advanced Data Cleaning: Engineering multi-stage cleaning using Regex
  • Target Variable Engineering: Re-mapping multi-class medical labels into binary classification
  • Statistical Imputation: Implementing median-based imputation for missing values
  • Feature Engineering: One-Hot Encoding and Standard Scaling implementation
  • Model Implementation: Developing and fine-tuning Logistic Regression classifier

Project Highlights

  • Real-World Data Handling: Successfully navigated messy clinical dataset with standalone hyphens and invalid patterns
  • High-Fidelity Preprocessing: Achieved clean dataset from source with over 70% missing values
  • Scalable Workflow: Pipeline designed for easy testing of other models (Random Forests, SVMs)
  • Clinical Relevance: Focused on medical features like thalach and chest pain types
  • Interpretability-First Design: Selected Logistic Regression for clinician-friendly results

Heart Disease Prediction AI is a clinical data science project that utilizes the UCI Cleveland Heart Disease dataset to predict the presence of cardiovascular conditions in patients. The project is centered on converting complex medical parameters—such as cholesterol levels, blood pressure, and ECG results—into actionable binary classifications.

I developed this system to demonstrate a robust data preprocessing pipeline that handles non-standard missing values, anomalous data entries, and feature engineering for medical diagnostics, providing a high-interpretability model suitable for clinical applications.

The project follows a systematic medical ML pipeline:

  1. Data Cleaning: Regex-based sanitization of junk characters and null bytes
  2. Target Engineering: Mapping multi-class labels (0-4) to binary diagnosis
  3. Missing Data Handling: Median-based imputation for physiological measurements
  4. Feature Processing: One-Hot Encoding for categorical variables and Standard Scaling
  5. Model Selection: Logistic Regression with liblinear solver for interpretability
  6. Clinical Validation: Focus on medically relevant features and results

Technologies Used

  • Python 3 – Primary language for data analysis
  • Scikit-Learn – Model training and scaling
  • Pandas & NumPy – Matrix manipulation
  • Regex (re) – Data sanitization
  • StandardScaler – Feature normalization
  • UCI Cleveland Dataset – Medical data source
  • Jupyter Notebook – Development environment
  • Matplotlib/Seaborn – Data visualization

Key Features

  • Resilient Data Loading with custom whitespace handling
  • Binary Diagnostic Mapping from complex severities
  • Automated Feature Scaling for medical variables
  • Advanced Regex cleaning for clinical data
  • Interpretability-First Model Design
  • Statistical Imputation Methods
  • Clinical Feature Engineering
  • Scalable Medical Pipeline

Project Impact

  • Clinical Applicability: System provides actionable binary classifications for medical diagnosis
  • Data Resilience: Successfully handles messy real-world medical datasets with high missing values
  • Interpretable Results: Logistic Regression model offers transparency for clinical decision-making
  • Scalable Solution: Pipeline architecture adaptable to other medical prediction tasks